import pickle
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
#df = pd.read_csv('./data/dataset_open_food.csv', delimiter='\t')
Comments:
Here we start our exploratory data analysis of the Open Food Facts Database.
The first step of our study is to compute the null columns, and the repartition of the different columns.
Hence, we can already describe our dataset by computing the mean, standard deviation and min-max values of the numerical columns.
df.describe()
## MODIFICATION HERE ##
plt.figure(figsize=(10, 50))
plt.rcParams['axes.facecolor'] = 'white'
plt.rc('grid')
(df.isnull().mean(axis=0)*100).plot.barh()
plt.xlim(right=100)
plt.title("Missing values rate")
plt.xlabel("percentage")
plt.show();
Comments:
_Furthermore, there are some missing values in the dataset. Actually some of these columns are nearly empty, for example the composition columns ("cocoa_100g", "zinc100g", etc...). But this can easily be explained by the fact that it is impossable to put on every product's label its complete composition.
Moreover, we can choose to discard features with more than 60% missing values (for instance), because these features are inconsistent. However, we do not want to do that because in the following parts, we are going to discuss some specific aspects of our food activity :
These are some negative aspects of our food consumption, and we are going to study these aspects in regard of the french food nutrition grade (given in many countries). Specifically, we are going to study the World and US consumptions concerning these aspects.
partb = pd.read_pickle('./data/partb.pkl')
In this part, we are going to discuss the effect of the dangerous food additives on its nutrition grade, and consequently our everyday food consumption.
In fact, the nutrition grade is given in the range from "a" to "e" (from best grade to worst grade). But we want to know if the food products with the dangerous additives, or the most additives have good or bad nutrition grades.
Hence, in order to determine which additives are bad we are going to first use the list in the "Hungry for Change" website. This list contains the most dangerous food additives that we can find on the market, but are not forbidden. This list was collected using different scientific studies on the matter.
Source: http://www.hungryforchange.tv/article/top-10-food-additives-to-avoid)
additives_pickle = pd.read_pickle('./data/additives.pkl')
additives_new = additives_pickle.additives_tags.dropna().map(lambda x : x.lower())
additives_e250 = additives_new.str.contains('e250')
additives_nutrition = additives_pickle.nutrition_grade_fr
additives_both = additives_pickle.iloc[additives_e250.index]
additives_both[additives_both.nutrition_grade_fr.notnull()].sample(5)
order = ['a', 'b', 'c','d','e']
def grade_dist(title, df_dist):
fig = df_dist.apply(pd.value_counts).loc[order].plot(kind='bar', subplots=True)
plt.rcParams['axes.facecolor'] = 'white'
plt.title(title)
plt.xlabel("Nutrition Grade")
plt.ylabel("Number of products")
plt.show();
additives_result_e250 = pd.DataFrame(additives_both.nutrition_grade_fr)
grade_dist("Grade Distribution for E250 additive in the World", additives_result_e250)
Comment:
Here we do find the e250 additive (Used for curing(preserving) meat and fish products) which can cause acute methemoglobinemia (haemoglobin loses its ability to carry oxygen), irritability, lack of energy, headache, brain damages or even death in severe untreated cases. However, what is alarming is that we find many products which have this additive, but keep a "a" and "b" nutrition grade.
Now, we are going to do the same analysis on a the list given by the "Hungry for Change" website on the world dataset, and on the specific US dataset.
list_dangerous_additives = ['e951','e621', 'e133', 'e124','e110', 'e102', 'e221', 'e320', 'e220']
kwstr = '|'.join(list_dangerous_additives)
mask = additives_new.to_frame().stack().str.contains(kwstr).any(level=0)
df_additives = additives_new[mask]
additives_all = additives_pickle.iloc[df_additives.index]
additives_result_all = pd.DataFrame(additives_all.nutrition_grade_fr)
grade_dist("Grade Distribution for dangerous additives in the World", additives_result_all)
partb_new_null_additives = partb.additives_tags.isnull()
df_additives_null = partb.iloc[partb_new_null_additives.index]
df_result_additives_null = pd.DataFrame(df_additives_null.nutrition_grade_fr)
grade_dist("Grade Distribution for Non-Additives in the World", df_result_additives_null)
partb.countries = partb.countries.str.lower()
# Fix some of the names with multiple entries
partb.loc[partb['countries'] == 'en:fr','countries'] = 'france'
partb.loc[partb['countries'] == 'en:es','countries'] = 'spain'
partb.loc[partb['countries'] == 'en:gb','countries'] ='united kingdom'
partb.loc[partb['countries'] == 'en:uk','countries'] ='united kingdom'
partb.loc[partb['countries'] == 'españa','countries'] ='spain'
partb.loc[partb['countries'] == 'us','countries'] = 'united states'
partb.loc[partb['countries'] == 'en:us','countries'] ='united states'
partb.loc[partb['countries'] == 'usa','countries'] = 'united states'
partb.loc[partb['countries'] == 'en:cn','countries'] = 'canada'
partb.loc[partb['countries'] == 'en:au','countries'] = 'australia'
partb.loc[partb['countries'] == 'en:de','countries'] ='germany'
partb.loc[partb['countries'] == 'deutschland','countries'] ='germany'
partb.loc[partb['countries'] == 'en:be','countries'] ='belgium'
partb.loc[partb['countries'] == 'en:ma','countries'] ='morocco'
partb.loc[partb['countries'] == 'en:ch','countries'] ='switzerland'
## US plot ##
us = ['united states']
partb_us = partb[partb.countries.isin(us)]
partb_additives_us = partb_us.additives_tags.dropna().map(lambda x : x.lower())
additives_us = partb.iloc[partb_additives_us.index]
result_additives_us = pd.DataFrame(additives_us.nutrition_grade_fr)
grade_dist("Grade Distribution for dangerous additives in the US", result_additives_us)
partb_new_us_null_additives = partb_us.additives_tags.isnull()
df_us_additives_null = partb.iloc[partb_new_us_null_additives.index]
df_result_us_additives_null = pd.DataFrame(df_us_additives_null.nutrition_grade_fr)
grade_dist("Grade Distribution for Non-Additives in the US",df_result_us_additives_null)
Comment: As you can see from the two plots above, is that most of the products (in both the world and the US do have a nutrition grade of "d", which is a pretty bad grade, but is not the worst grade. Hence, now we can start discussion the impact of many dangerous additives on our food consumption.
## MODIFICATION HERE ##
countries = ['france','united kingdom','spain','germany','united states','australia','canada', 'belgium', 'morocco', 'switzerland']
df_countries = partb[partb.countries.isin(countries)]
df_countries_additives = df_countries[df_countries.additives_n.notnull()]
df_groupedby_countries_additives = df_countries_additives.groupby(['countries']).mean().additives_n.reset_index()
np_countries_additives = np.array(df_groupedby_countries_additives)
np_countries_additives = np_countries_additives[np_countries_additives[:,1].argsort()[::-1]]
# Plot the average number of additives per country
fig = plt.figure(figsize=(15,8))
ax1 = fig.add_subplot(1,1,1)
y_pos = np.arange(len(np_countries_additives[:,0]))
x_pos = np_countries_additives[:,1]
x_ticks = np_countries_additives[:,0]
# Make a barplot
plt.bar(y_pos, x_pos, align='center')
plt.title('Average number of additives per product by country')
plt.xticks(y_pos, x_ticks)
plt.ylabel('Average number of additives')
plt.show()
import plotly.express as px
import country_converter as coco
cc = coco.CountryConverter()
#we create here the ISO-3 country codes in order to plot our map in plotly
df_groupedby_countries_additives['iso_code'] = df_groupedby_countries_additives['countries'].apply(lambda x : coco.convert(names=x, to='ISO3'))
fig = px.choropleth(df_groupedby_countries_additives, locations="iso_code",
color="additives_n", # additives_n is a column of our dataframe
hover_name="countries", # column to add to hover information
color_continuous_scale=px.colors.sequential.Plasma)
fig.show()
Comments:
Here, we are ranking 10 countries from our list of countries in the dataset, in order to identify the countries which use the biggest average number of additivs per product.
Normally, this ranking should reflect the food consumption habits and use of additives of its population.
It can be clearly seen that Belgium and Australia use the most additives out of these countries, and spain and germany the least. This also demonstrates the most healthy and unhealthy additives habits of these countries.
def prod_dist(title, feature, xlabel):
plt.figure(figsize=(12,10))
ax = partb[partb[feature] > 0.0][feature].value_counts().plot(kind='bar',color='cadetblue')
plt.title(title,fontsize=14)
plt.xlabel(xlabel)
plt.ylabel('Number of Products')
plt.show();
prod_dist('Distribution of the number of additives by products','additives_n', 'Additives')
Comment:
This plot shows the distribution of the number of additives by products. It can be clearly seen that most of the products use 1 to 5 additives in their composition. However, what is astonishing is that we do find products with more than 15 additives in total ! Based on our research, the types of additives used are regulated and has to be approved prior to use. However, there is no limitations for the number of additives that can be added to food products.
import matplotlib.patches as mpatches
additives = (additives_pickle['additives_en'].str.extractall("(?P<Count>[E]\d\d\d\w?)"))
additives_count = additives.apply(pd.value_counts).head(20)
additives_count['additives_num'] = additives_count.index
additives_count.reset_index(drop=True,inplace=True)
additives_mapping = {'E330': 'black','E322':'red','E322i':'red','E101':'blue','E375':'black','E101i':'blue',
'E300':'yellow','E415':'red','E412':'black','E500':'black','E471':'red','E203':'green','E407':'red',
'E440':'red','E250':'green','E150a':'blue','E450':'black','E500i':'blue','E331':'black',
'E129':'black','E339':'black','E440i':'red','E160a':'blue','E270':'black','E102':'blue',
'E410':'red','E133':'blue','E341':'black','E428':'red','E621':'black','E202':'blue'}
additives_count['colors'] = additives_count['additives_num'].map(additives_mapping)
ax = additives_count.plot(x='additives_num',y='Count',kind='barh',color=additives_count['colors'],figsize=(15,10))
ax.invert_yaxis()
ax.legend().set_visible(False)
ax.set_title('20 most used additives in food products')
colors = mpatches.Patch(color='blue', label='food colouring')
others = mpatches.Patch(color='black', label='other')
emulsifiers = mpatches.Patch(color='red', label='emulsifiers')
antioxidant = mpatches.Patch(color='yellow', label='antioxidants')
preservatives = mpatches.Patch(color='green', label='food preservatives')
plt.legend(handles=[colors,others,emulsifiers,antioxidant,preservatives])
plt.xlabel('Number of products')
plt.ylabel('types of additives')
plt.show();
Comment:
This is the ranking of the 20 most used additives by number of products, and we distinguish the classes of additives.
In fact, this complete list of dangerous additives and their classes was taken from the World Health Organization and the FDA. And as you can see above, the most dangerous and popular additives in food products are "emulsifiers" (and "other") additives present in more than 100000 products in the dataset.
Sources:
https://www.fda.gov/food/food-additives-petitions/food-additive-status-list
https://www.who.int/en/news-room/fact-sheets/detail/food-additives
Comment:
In this part, we are going to study the impact of allergens on the nutrition grade.
More specifically, this part will also take into consideration the US-special case in order to compare it to the rest of the world.
df_us_allergens = partb_us[['nutrition_grade_fr', 'allergens']]
df_allergens_notnull = partb.allergens.dropna()
print('the Shape of the US Dataframe : ', partb_us.shape)
print('Number of Allergens Descriptions in the whole Dataframe : ', len(df_allergens_notnull.index))
print('Number of Allergens Descriptions in the US Dataframe : ', len(df_us_allergens.allergens.dropna().index))
partb_new_allergens = partb.allergens.dropna().map(lambda x : x.lower())
df_allergens = partb.iloc[partb_new_allergens.index]
df_result_allergens = pd.DataFrame(df_allergens.nutrition_grade_fr)
grade_dist("Grade Distribution for Allergens in the World", df_result_allergens)
partb_new_null_allergens = partb.allergens.isnull()
df_allergens_null = partb.iloc[partb_new_null_allergens.index]
df_result_allergens_null = pd.DataFrame(df_allergens_null.nutrition_grade_fr)
grade_dist("Grade Distribution for Non-Allergens in the World", df_result_allergens_null)
partb_new_allergens_us = partb_us.allergens.dropna().map(lambda x : x.lower())
df_us_allergens = partb.iloc[partb_new_allergens_us.index]
df_result_us_allergens = pd.DataFrame(df_us_allergens.nutrition_grade_fr)
grade_dist("Grade Distribution for Allergens in the US",df_result_us_allergens)
partb_new_us_null_allergens = partb_us.allergens.isnull()
df_us_allergens_null = partb.iloc[partb_new_us_null_allergens.index]
df_result_us_allergens_null = pd.DataFrame(df_us_allergens_null.nutrition_grade_fr)
grade_dist("Grade Distribution for Non-Allergens in the US",df_result_us_allergens_null)
Comment: In this part, we see that for the rest of the world, we have a high distribution of allergens in the worst grades (similar to additives).
However, the distribution of the allergens on the US market, show a total imbalance on the nutrition grade. In fact, we have the same number of products with grade "b" and grade "d". This imbalance can also be explained by the low number of products with "allergens" labels on them. (845 for the US, and 82895 for the rest of the world) Hence, we can conclude that the impact of the allergens on the US products cannot be proven.
Comments:
Finally, we are going to study the use of palm oil ingredients (or ingredients that may be from palm oil) on the food consumption.
palm_list = ['ingredients_from_palm_oil_n',
'ingredients_from_palm_oil',
'ingredients_from_palm_oil_tags',
'ingredients_that_may_be_from_palm_oil_n',
'ingredients_that_may_be_from_palm_oil',
'ingredients_that_may_be_from_palm_oil_tags','nutrition_grade_fr']
df_palm = partb[palm_list]
df_us_palm = partb_us[palm_list]
print('the number of the Palm Oil products in the whole dataframe : ', len(df_palm.ingredients_from_palm_oil_n.dropna().index))
print('the number of the Palm Oil products in the US dataframe : ', len(df_us_palm.ingredients_from_palm_oil_n.dropna().index))
prod_dist('Distribution of the number of Palm Oil Ingredients by products','ingredients_from_palm_oil_n', 'Palm oil Ingredients' )
Comments:
Here we wanted to see the distribution of the number of palm oil ingredients in some of our products (in the world dataset). Hence, most of the products only contain one product containing palm oil. But we do have food products with 3 ingredients containing palm oil, which is inevitably bad for your health.
palm_oil_tags_new = partb.ingredients_from_palm_oil_tags.dropna().map(lambda x : x.lower())
df_palm_ingredients = partb.iloc[palm_oil_tags_new.index]
df_result_palm = pd.DataFrame(df_palm_ingredients.nutrition_grade_fr)
grade_dist("Grade Distribution for Palm Oil Products in the World",df_result_palm )
partb_new_null_palm_oil_tags = partb.ingredients_from_palm_oil_tags.isnull()
df_palm_oil_tags_null = partb.iloc[partb_new_null_palm_oil_tags.index]
df_result_palm_oil_tags_null = pd.DataFrame(df_palm_oil_tags_null.nutrition_grade_fr)
grade_dist("Grade Distribution for Non-Palm Oil Products in the World", df_result_palm_oil_tags_null)
palm_oil_tags_new_us = partb_us.ingredients_from_palm_oil_tags.dropna().map(lambda x : x.lower())
df_palm_us_ingredients = partb.iloc[palm_oil_tags_new_us.index]
df_result_us_palm = pd.DataFrame(df_palm_us_ingredients.nutrition_grade_fr)
grade_dist("Grade Distribution for Palm Oil Products in the US", df_result_us_palm)
print('The number of products studied in the US Palm Oil Dataframe is only : ', len(df_palm_us_ingredients.index))
#create a column that will be used in order to compute the number of suspected palm oil ingredients in a product
df_countries['palm_oil_n'] = df_countries['ingredients_from_palm_oil_n'] + df_countries['ingredients_that_may_be_from_palm_oil_n']
df_countries_palm = df_countries[df_countries.palm_oil_n.notnull()]
df_groupedby_countries_palm = df_countries_palm.groupby(['countries']).mean().palm_oil_n.reset_index()
np_countries_palm = np.array(df_groupedby_countries_palm)
np_countries_palm = np_countries_palm[np_countries_palm[:,1].argsort()[::-1]]
# Plot the average number of additives per country
fig = plt.figure(figsize=(15,8))
ax1 = fig.add_subplot(1,1,1)
y_pos = np.arange(len(np_countries_palm[:,0]))
x_pos = np_countries_palm[:,1]
x_ticks = np_countries_palm[:,0]
# Make a barplot
plt.bar(y_pos, x_pos, align='center')
plt.title('Average number of Palm oil ingredients per product by country')
plt.xticks(y_pos, x_ticks)
plt.ylabel('Average number of Palm oil ingredients')
plt.show()
import plotly.express as px
import country_converter as coco
cc = coco.CountryConverter()
#we create here the ISO-3 country codes in order to plot our map in plotly
df_groupedby_countries_palm['iso_code'] = df_groupedby_countries_palm['countries'].apply(lambda x : coco.convert(names=x, to='ISO3'))
fig = px.choropleth(df_groupedby_countries_palm, locations="iso_code",
color="palm_oil_n", # Palm_oil_n is a column of our dataframe
hover_name="countries", # column to add to hover information
color_continuous_scale=px.colors.sequential.Plasma)
fig.show()
Comments (to be updated):
To conclude, we wanted to see the distribution of food products containing palm oil ingredients in regard to the nutrition grade.
Hence, with no surprise, we see that the number of products increase gradually with the negative grades (the most number of products have the worst grade which is "e") in the world dataset.
However, we cannot conclude anything the US dataset because we only have 3 products containing in their labels the "palm oil" tag. This can be explained by the fact that many US lobbies want their products to continue using palm oil without informing the public. (source "The Guardian")
Source: https://www.theguardian.com/news/2019/feb/19/palm-oil-ingredient-biscuits-shampoo-environmental
sns.set(style="white")
#list of correlation elements that we will study
list_additives_palm = ['nutrition_grade_fr', 'additives_n', 'ingredients_from_palm_oil_n', 'ingredients_that_may_be_from_palm_oil_n' ]
#correlation matrix function specific to our study
def corr_matrix(df_test, title):
fig = plt.figure(figsize=(20,20))
df_test_new = df_test[list_additives_palm]
df_test_new = pd.get_dummies(df_test_new, columns=['nutrition_grade_fr'])
plt.matshow(df_test_new.corr())
plt.title(title)
plt.show();
corr_matrix(partb, 'Nutrition Grade Correlation Matrix on the World');
corr_matrix(partb_us, 'Nutrition Grade Correlation Matrix on the World');
Here we wanted to plot the correlation matrices regarding our findings in the 3 previous parts : we only find a correlation between additives and palm oil (0 and 2 columns).